Author: Hang He
GitHub Page: https://GavinHHE.github.io.
Class: CMPS 3160
Dataset Source:
From CDC: PLACES: Local Data for Better Health, Census Tract Data 2020 release:https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh
From Kaggle: Cardio Vascular Disease Detection: https://www.kaggle.com/bhadaneeraj/cardio-vascular-disease-detection
Diabetes Health Indicators Dataset: https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset
Code Reference: https://plotly.com/python/choropleth-maps/
Introduction
ETL and EDA
Model Construction and Evaluation
Conclusion
From heart.org, an artical states that nearly half of American adults have high blood pressure. As we know, most of the time, high blood pressure (HBP, or hypertension) has no obvious symptoms to indicate that something is wrong. It develops slowly over time and can be related to many causes. For the project, I will deep dive the health data from CDC and Cardiovascular Disease data from Kaggle by visualization and analysis. The final report will include a visualization of percentage of population that has HBP by states and analysis on the importance of HBP as a risk factor to Cardiovascular Disease and Diabetes. I will also include machine learning models to make predictions using available data. Hopefully, models would be able to predict whether a specific person has cardiovascular Disease or Diabetes accurately.
All data can be aslo found in the repository.
Census Tract Data 2020 release(2017 to 2018) is filled with data regarding the overall responses of surveys conducted by multiple organizations. Columns in the datasets includes when and where the survey was conducted, total population involved, descriptions of the question asked, and the responses value in percentage. I will use this dataset to visualize HBP rate and perform some basic caulation.
Cardio Vascular Disease Detection is filled with the data regarding people both with and without Cardiovascular Disease. Personal information including age, gender, height, weight, blood pressure measurement and etc. The dataset also have columns indicating smoking, drinking and exercise status. I will also assess those three risk factors in the machine learning part.
Diabetes Health Indicators Dataset is filled with the data regarding people both with and without Diabetes.Columns include whether the person smoking or drinking, age group, education level, income level, gender and etc. There is no missing values in dataset. Many of the columns are catergorical or boolean variables.
For all three dataset, there is no missing value. Data from Census Tract Data 2020 release are clean and do not need futhur data cleaning. There are many extreme values that are unrealistic in Cardio Vascular Disease Detection. I will deal with those values by droping or transforming. Outliers in Census Tract Data are not removed in the EDA part, but I will transform or drop outliers when doing pridictions. Most outlier in Diabetes Health Indicators Dataset are removed in the EAD part.
The main question for my project is "How risky HBP is? Will it cause people to have Cardiovascular Disease or Diabetes?". In addition to that, I would also like to analyze other important risk factors that are related to diseases.
here I will load the Census Tract Data 2020 release. There is a description about the columns on the website: https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh. Since I am only interested in the survey results about blood pressure, I will slice the data and choose columns: ['Year','StateAbbr','StateDesc','CountyName','Measure','Data_Value_Unit','Data_Value','Geolocation']
# ETL Process
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import requests
# from bs4 import BeautifulSoup
# from IPython.core.display import display, HTML
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
Health_df= pd.read_csv('../newProj/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2020_release.csv',low_memory=False)
## take rows that are related to HBP only
BloodPressure_df = Health_df[Health_df.Short_Question_Text=='High Blood Pressure'].copy()
BloodPressure_df.head()
| Year | StateAbbr | StateDesc | CountyName | CountyFIPS | LocationName | DataSource | Category | Measure | Data_Value_Unit | ... | Data_Value_Footnote | Low_Confidence_Limit | High_Confidence_Limit | TotalPopulation | Geolocation | LocationID | CategoryID | MeasureId | DataValueTypeID | Short_Question_Text | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3317 | 2017 | AL | Alabama | Crenshaw | 1041 | 1041963600 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 44.6 | 45.9 | 3180 | POINT (-86.36923855 31.72464193) | 1041963600 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3329 | 2017 | AL | Alabama | Lauderdale | 1077 | 1077011000 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 41.2 | 43.3 | 4612 | POINT (-87.6820897 34.82994416) | 1077011000 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3338 | 2017 | AL | Alabama | Franklin | 1059 | 1059972900 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 40.3 | 42.4 | 4008 | POINT (-87.61995937 34.52317217) | 1059972900 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3340 | 2017 | AL | Alabama | Jefferson | 1073 | 1073010803 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 36.0 | 37.8 | 6514 | POINT (-86.71445129 33.51402095) | 1073010803 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3363 | 2017 | AL | Alabama | Jefferson | 1073 | 1073010802 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 31.7 | 33.7 | 3448 | POINT (-86.76308889 33.48895376) | 1073010802 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
5 rows × 23 columns
## removing columns not need
col = ['Year','StateAbbr','StateDesc','CountyName','Measure','Data_Value_Unit','Data_Value','Geolocation']
BloodPressure_df=BloodPressure_df[col]
BloodPressure_df.reset_index(drop=True,inplace=True)
BloodPressure_df.head()
| Year | StateAbbr | StateDesc | CountyName | Measure | Data_Value_Unit | Data_Value | Geolocation | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2017 | AL | Alabama | Crenshaw | High blood pressure among adults aged >=18 years | % | 45.3 | POINT (-86.36923855 31.72464193) |
| 1 | 2017 | AL | Alabama | Lauderdale | High blood pressure among adults aged >=18 years | % | 42.3 | POINT (-87.6820897 34.82994416) |
| 2 | 2017 | AL | Alabama | Franklin | High blood pressure among adults aged >=18 years | % | 41.4 | POINT (-87.61995937 34.52317217) |
| 3 | 2017 | AL | Alabama | Jefferson | High blood pressure among adults aged >=18 years | % | 36.9 | POINT (-86.71445129 33.51402095) |
| 4 | 2017 | AL | Alabama | Jefferson | High blood pressure among adults aged >=18 years | % | 32.6 | POINT (-86.76308889 33.48895376) |
## Checking for missing values
BloodPressure_df.isna().value_counts()
Year StateAbbr StateDesc CountyName Measure Data_Value_Unit Data_Value Geolocation False False False False False False False False 72337 dtype: int64
## Survey results are recorded by county
## I want to know the mean HBP rate by states
Percent_mean_state = BloodPressure_df.groupby('StateAbbr').Data_Value.mean()
## Top 5 states with highest HBP rate
Percent_mean_state.sort_values(ascending=False)[:5]
StateAbbr WV 42.212190 AL 41.759660 MS 41.332219 LA 40.080249 KY 39.833544 Name: Data_Value, dtype: float64
## Top 5 states with lowest HBP rate
Percent_mean_state.sort_values(ascending=True)[:5]
StateAbbr UT 23.843590 CO 25.433736 MN 26.560495 CA 27.377464 MA 28.436500 Name: Data_Value, dtype: float64
## Transform the pd series to dataframe for futhur analysis
State =[]
Mean_Percentage_HBP =[]
for var in Percent_mean_state:
State.append(Percent_mean_state[Percent_mean_state == var].index[0])
Mean_Percentage_HBP.append(var)
dict = {'State':State,'Mean_Percentage_HBP': Mean_Percentage_HBP}
Percentage_HBP_df = pd.DataFrame(dict)
Percentage_HBP_df.head()
| State | Mean_Percentage_HBP | |
|---|---|---|
| 0 | AK | 30.044910 |
| 1 | AL | 41.759660 |
| 2 | AR | 39.295175 |
| 3 | AZ | 29.618470 |
| 4 | CA | 27.377464 |
# headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
# response = requests.get('https://developers.google.com/public-data/docs/canonical/states_csv', timeout=10, headers=headers)
# soup = BeautifulSoup(response.text, 'html.parser')
# geo=soup.find('table').get_text().split('\n\n\n')
# geo[1]
# State=[]
# Latitude = []
# Longitude =[]
# for var in geo[1:]:
# State.append(var.split('\n')[3])
# Latitude.append(var.split('\n')[1])
# Longitude.append(var.split('\n')[2])
# dict = {'State':State,'Latitude': Latitude, 'Longitude': Longitude}
# geo_df = pd.DataFrame(dict)
# Geo_Percentage_HBP_df = Percentage_HBP_df.merge(geo_df, on="State",how = 'left')
## Plot the average HBP rate by state
## reference: https://plotly.com/python/choropleth-maps/
import plotly.graph_objects as go
fig = go.Figure(data=go.Choropleth(
locations=Percentage_HBP_df['State'], # Spatial coordinates
z = Percentage_HBP_df['Mean_Percentage_HBP'].astype(float), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'Reds',
colorbar_title = "Mean HBP rate",
))
fig.update_layout(
title_text = '2017 US High Blood Pressure rates by State',
geo_scope='usa', # limite map scope to USA
)
fig.show()
Age | Objective Feature | age | int (days)
Height | Objective Feature | height | int (cm) |
Weight | Objective Feature | weight | float (kg) |
Gender | Objective Feature | gender | categorical code |
Systolic blood pressure | Examination Feature | ap_hi | int |
Diastolic blood pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol intake | Subjective Feature | alco | binary |
Physical activity | Subjective Feature | active | binary |
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
Disease_df= pd.read_csv('../newProj/cardio_disease.csv',';')
Disease_df
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69995 | 99993 | 19240 | 2 | 168 | 76.0 | 120 | 80 | 1 | 1 | 1 | 0 | 1 | 0 |
| 69996 | 99995 | 22601 | 1 | 158 | 126.0 | 140 | 90 | 2 | 2 | 0 | 0 | 1 | 1 |
| 69997 | 99996 | 19066 | 2 | 183 | 105.0 | 180 | 90 | 3 | 1 | 0 | 1 | 0 | 1 |
| 69998 | 99998 | 22431 | 1 | 163 | 72.0 | 135 | 80 | 1 | 2 | 0 | 0 | 0 | 1 |
| 69999 | 99999 | 20540 | 1 | 170 | 72.0 | 120 | 80 | 2 | 1 | 0 | 0 | 1 | 0 |
70000 rows × 13 columns
##Checkling null values for each columns
Disease_df.isna().value_counts()
id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio False False False False False False False False False False False False False 70000 dtype: int64
### I will also check the abnormal values for columns by looking at the minimum and maximum values
for var in Disease_df.columns.values:
print('In column '+str(var)+
' Max values :'+str(Disease_df[var].max())+
' Min values :'+str(Disease_df[var].min()))
print('------------------')
In column id Max values :99999 Min values :0 ------------------ In column age Max values :23713 Min values :10798 ------------------ In column gender Max values :2 Min values :1 ------------------ In column height Max values :250 Min values :55 ------------------ In column weight Max values :200.0 Min values :10.0 ------------------ In column ap_hi Max values :16020 Min values :-150 ------------------ In column ap_lo Max values :11000 Min values :-70 ------------------ In column cholesterol Max values :3 Min values :1 ------------------ In column gluc Max values :3 Min values :1 ------------------ In column smoke Max values :1 Min values :0 ------------------ In column alco Max values :1 Min values :0 ------------------ In column active Max values :1 Min values :0 ------------------ In column cardio Max values :1 Min values :0 ------------------
### Since a gegative value for both ap_hi and ap_lo is not realistic, I will convert those negative values into positive
Disease_df.ap_hi = Disease_df.ap_hi.apply(lambda x: x if x>0 else x*(-1))
Disease_df.ap_lo = Disease_df.ap_lo.apply(lambda x: x if x>0 else x*(-1))
There is no explaination on the meaning of 1 and 2 of gender column. After compared the mean weight and height, I figured out Gender value 2 is male, Gender value 1 is female.
Disease_df.groupby('gender').height.mean()
gender 1 161.355612 2 169.947895 Name: height, dtype: float64
Disease_df.groupby('gender').weight.mean()
gender 1 72.565605 2 77.257307 Name: weight, dtype: float64
## Gender value 2 is male, gender value 1 is female
Disease_df.gender = Disease_df.gender.apply(lambda x: 'Female' if x==1 else 'Male')
Accoring to the documentation, age is stored in days.
### Age is recorded in days, I will convert those values into years
Disease_df.age = Disease_df.age.apply(lambda x: round(x/365,1))
By looking at the box plots of height,weight,ap_hi and ap_lo, there are many extreme values that some are unrealistic. I will remove the unrealistic values. Outliers could be meaning here, I will transform outliers later.
Disease_df[['height','weight']].boxplot()
plt.show()
According to https://en.wikipedia.org/wiki/List_of_the_verified_shortest_people, the shortest recorded is 54.6. I will remove rows that has highet lower than 55
Disease_df = Disease_df[Disease_df.height>=55]
Disease_df[['ap_hi','ap_lo']].boxplot()
plt.show()
I observed many outliers in both ap_hi and ap_low. According to https://pubmed.ncbi.nlm.nih.gov/7741618/, the highest highest pressure recorded is 370/360, I will remove rows that has ap_hi or ap_low higher than 360
Disease_df = Disease_df[(Disease_df.ap_hi<=360)& (Disease_df.ap_lo<=360)]
It is unrealistic to have living person with DIASTOLIC pressure equals to or greater than SYSTOLIC pressure
It is unrealistic to have living person wit DIASTOLIC and SYSTOLIC pressures less than 50
Disease_df = Disease_df[(Disease_df.ap_hi>=50) & (Disease_df.ap_lo>=50)]
Disease_df = Disease_df[Disease_df.ap_lo<Disease_df.ap_hi]
Disease_df.shape
(68659, 13)
## There are still many outliers in the dataset after we removed extrme values or values are unrealistic
## I will deal with the rest of outliers by replacing values with boundarys that calculated using 1.5xInterquartile Range
## rest index
Disease_df.reset_index(drop=True,inplace=True)
Disease_df.head()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 50.4 | Male | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 55.4 | Female | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 2 | 51.7 | Female | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 48.3 | Male | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 4 | 47.9 | Female | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
Accoring to the documentation, gender,cholesterol,gluc,smoke,alco,active, and cardio are category variables
## Check data type.
Disease_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68659 entries, 0 to 68658 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 68659 non-null int64 1 age 68659 non-null float64 2 gender 68659 non-null object 3 height 68659 non-null int64 4 weight 68659 non-null float64 5 ap_hi 68659 non-null int64 6 ap_lo 68659 non-null int64 7 cholesterol 68659 non-null int64 8 gluc 68659 non-null int64 9 smoke 68659 non-null int64 10 alco 68659 non-null int64 11 active 68659 non-null int64 12 cardio 68659 non-null int64 dtypes: float64(2), int64(10), object(1) memory usage: 6.8+ MB
Values of 1, 2 and 3 are hard to interpret for columns cholesterol and gluc. I maped both columns according to the data spcification provided.
Disease_df["cholesterol"]=Disease_df["cholesterol"].map({
1: "normal",
2: "above normal",
3: "well above normal",
})
Disease_df["gluc"]=Disease_df["gluc"].map({
1: "normal",
2: "above normal",
3: "well above normal",
})
Disease_dfCleaned = Disease_df.drop(columns='id').copy()
Disease_dfCleaned
| age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50.4 | Male | 168 | 62.0 | 110 | 80 | normal | normal | 0 | 0 | 1 | 0 |
| 1 | 55.4 | Female | 156 | 85.0 | 140 | 90 | well above normal | normal | 0 | 0 | 1 | 1 |
| 2 | 51.7 | Female | 165 | 64.0 | 130 | 70 | well above normal | normal | 0 | 0 | 0 | 1 |
| 3 | 48.3 | Male | 169 | 82.0 | 150 | 100 | normal | normal | 0 | 0 | 1 | 1 |
| 4 | 47.9 | Female | 156 | 56.0 | 100 | 60 | normal | normal | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 68654 | 52.7 | Male | 168 | 76.0 | 120 | 80 | normal | normal | 1 | 0 | 1 | 0 |
| 68655 | 61.9 | Female | 158 | 126.0 | 140 | 90 | above normal | above normal | 0 | 0 | 1 | 1 |
| 68656 | 52.2 | Male | 183 | 105.0 | 180 | 90 | well above normal | normal | 0 | 1 | 0 | 1 |
| 68657 | 61.5 | Female | 163 | 72.0 | 135 | 80 | normal | above normal | 0 | 0 | 0 | 1 |
| 68658 | 56.3 | Female | 170 | 72.0 | 120 | 80 | above normal | normal | 0 | 0 | 1 | 0 |
68659 rows × 12 columns
## transform data type
colList = ['gender','cholesterol','gluc','smoke','alco','active','cardio']
for var in colList:
Disease_dfCleaned[var] = Disease_dfCleaned[var].astype('category')
Disease_dfCleaned.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68659 entries, 0 to 68658 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 68659 non-null float64 1 gender 68659 non-null category 2 height 68659 non-null int64 3 weight 68659 non-null float64 4 ap_hi 68659 non-null int64 5 ap_lo 68659 non-null int64 6 cholesterol 68659 non-null category 7 gluc 68659 non-null category 8 smoke 68659 non-null category 9 alco 68659 non-null category 10 active 68659 non-null category 11 cardio 68659 non-null category dtypes: category(7), float64(2), int64(3) memory usage: 3.1 MB
figure = plt.figure(figsize=(15,5))
plt.subplot(1,5,1)
ax1 = plt.hist(Disease_dfCleaned['weight'], bins=30)
plt.title('Weight')
plt.subplot(1,5,2)
ax2 = plt.hist(Disease_dfCleaned['height'], bins=30)
plt.title('Height')
plt.subplot(1,5,3)
ax2 = plt.hist(Disease_dfCleaned['ap_lo'], bins=30)
plt.title('ap_lo')
plt.subplot(1,5,4)
ax2 = plt.hist(Disease_dfCleaned['ap_hi'], bins=30)
plt.title('ap_hi')
plt.subplot(1,5,5)
ax2 = plt.hist(Disease_dfCleaned['age'], bins=10)
plt.title('Age')
plt.show()
By looking at the distribution of those 5 columns, weight and height seems follow normal distribution. There are some outliers in weight and height, but the number of outliers is too small to be shown on the grah. Age is postively skewed. It is hard to tell the distribution of ap_lo and ap_hi.
figure = plt.figure(figsize=(15,5))
plt.subplot(1,5,1)
ax1 = Disease_dfCleaned.groupby('cardio').weight.mean().plot.bar(color ='green')
plt.title('Mean Weight')
plt.subplot(1,5,2)
ax2 = Disease_dfCleaned.groupby('cardio').height.mean().plot.bar(color ='green')
plt.title('Mean Height')
plt.subplot(1,5,3)
ax3 = Disease_dfCleaned.groupby('cardio').ap_lo.mean().plot.bar(color ='green')
plt.title('Mean ap_lo')
plt.subplot(1,5,4)
ax4 = Disease_dfCleaned.groupby('cardio').ap_hi.mean().plot.bar(color ='green')
plt.title('Mean ap_hi')
plt.subplot(1,5,5)
ax5 = Disease_dfCleaned.groupby('cardio').age.mean().plot.bar(color ='green')
plt.title('Mean Age')
plt.show()
Here, I compared the mean value for those 5 columns. The graphs show that people with cardiovascular disease are slightly orlder and have higher blood pressure measurement.
figure = plt.figure(figsize=(25,5))
plt.subplot(1,5,1)
ax1 = Disease_dfCleaned[Disease_dfCleaned.smoke==1].cardio.value_counts().plot.bar(color ='green')
plt.title('Cardiovascular diseases among smoker')
plt.subplot(1,5,2)
ax1 = Disease_dfCleaned[Disease_dfCleaned.alco==1].cardio.value_counts().plot.bar(color ='green')
plt.title('Cardiovascular diseases among alcohol use')
plt.show()
I visualized the number of smokers and drinkers among people with cardiovascular diseases. From the graph, I would say the smoking or drinking may has no significant impact on cardiovascular diseases.
sns.heatmap(Disease_df.corr(), annot=True)
plt.gcf().set_size_inches(10,8)
I also created a heatmap to show the correlation between variables. The disease indicator, cardio, has high correlation with ap_hi and ap_low. Age and weight are also correlated with cardio.
diabetes_df= pd.read_csv('../newProj/diabetes_5050split_health_BRFSS2015.csv')
diabetes_df
| Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 3.0 | 5.0 | 30.0 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 |
| 1 | 0.0 | 1.0 | 1.0 | 1.0 | 26.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 10.0 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 |
| 3 | 0.0 | 1.0 | 1.0 | 1.0 | 28.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 29.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70687 | 1.0 | 0.0 | 1.0 | 1.0 | 37.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 4.0 | 1.0 |
| 70688 | 1.0 | 0.0 | 1.0 | 1.0 | 29.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 1.0 | 1.0 | 10.0 | 3.0 | 6.0 |
| 70689 | 1.0 | 1.0 | 1.0 | 1.0 | 25.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 5.0 | 15.0 | 0.0 | 1.0 | 0.0 | 13.0 | 6.0 | 4.0 |
| 70690 | 1.0 | 1.0 | 1.0 | 1.0 | 18.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0.0 | 0.0 | 1.0 | 0.0 | 11.0 | 2.0 | 4.0 |
| 70691 | 1.0 | 1.0 | 1.0 | 1.0 | 25.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 6.0 | 2.0 |
70692 rows × 22 columns
##Checkling null values for each columns
diabetes_df.isna().info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 70692 entries, 0 to 70691 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Diabetes_binary 70692 non-null bool 1 HighBP 70692 non-null bool 2 HighChol 70692 non-null bool 3 CholCheck 70692 non-null bool 4 BMI 70692 non-null bool 5 Smoker 70692 non-null bool 6 Stroke 70692 non-null bool 7 HeartDiseaseorAttack 70692 non-null bool 8 PhysActivity 70692 non-null bool 9 Fruits 70692 non-null bool 10 Veggies 70692 non-null bool 11 HvyAlcoholConsump 70692 non-null bool 12 AnyHealthcare 70692 non-null bool 13 NoDocbcCost 70692 non-null bool 14 GenHlth 70692 non-null bool 15 MentHlth 70692 non-null bool 16 PhysHlth 70692 non-null bool 17 DiffWalk 70692 non-null bool 18 Sex 70692 non-null bool 19 Age 70692 non-null bool 20 Education 70692 non-null bool 21 Income 70692 non-null bool dtypes: bool(22) memory usage: 1.5 MB
Conver the data type according to the column description provided by data uploader
col = ['MentHlth','PhysHlth','BMI']
for var in col:
diabetes_df[var]=diabetes_df[var].astype('int')
col = ['Age','Education','Income','Sex','GenHlth','Income']
for var in col:
diabetes_df[var]=diabetes_df[var].astype('category')
diabetes_df
| Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 3.0 | 5 | 30 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 |
| 1 | 0.0 | 1.0 | 1.0 | 1.0 | 26 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0 | 0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0 | 10 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 |
| 3 | 0.0 | 1.0 | 1.0 | 1.0 | 28 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0 | 3 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 29 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0 | 0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70687 | 1.0 | 0.0 | 1.0 | 1.0 | 37 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0 | 0 | 0.0 | 0.0 | 6.0 | 4.0 | 1.0 |
| 70688 | 1.0 | 0.0 | 1.0 | 1.0 | 29 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0 | 0 | 1.0 | 1.0 | 10.0 | 3.0 | 6.0 |
| 70689 | 1.0 | 1.0 | 1.0 | 1.0 | 25 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 5.0 | 15 | 0 | 1.0 | 0.0 | 13.0 | 6.0 | 4.0 |
| 70690 | 1.0 | 1.0 | 1.0 | 1.0 | 18 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0 | 0 | 1.0 | 0.0 | 11.0 | 2.0 | 4.0 |
| 70691 | 1.0 | 1.0 | 1.0 | 1.0 | 25 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0 | 0 | 0.0 | 0.0 | 9.0 | 6.0 | 2.0 |
70692 rows × 22 columns
Since most of columns are either catergorcial variables or boolean variables, I will check the distribution of BMI only
diabetes_df[['BMI']].boxplot()
plt.show()
In rare cases people have BMI greater than 50, I will drop rows with BMI higher than 50.
diabetes_df=diabetes_df[diabetes_df.BMI<=50]
There are few outliers, the potential effect of outliers in this case should be small since we have only few outliers
diabetes_df[['BMI']].boxplot()
<AxesSubplot:>
figure = plt.figure(figsize=(18,5))
plt.subplot(1,5,1)
ax1 = plt.hist(diabetes_df['BMI'], bins=20)
plt.title('BMI')
plt.subplot(1,5,2)
ax2 = diabetes_df.groupby('Diabetes_binary').BMI.mean().plot.bar(color ='green')
plt.title('Mean BMI')
Text(0.5, 1.0, 'Mean BMI')
BMI data seems follow normal distribtion. From another graph, I observed that people with Diabetes are tend to have higher BMI.
figure = plt.figure(figsize=(18,5))
plt.subplot(1,5,1)
ax2 = diabetes_df[diabetes_df.Smoker==1].Diabetes_binary.value_counts().plot.bar(color ='green')
plt.title('diabetes among smoker')
plt.subplot(1,5,2)
ax2 = diabetes_df[diabetes_df.HvyAlcoholConsump==1].Diabetes_binary.value_counts().plot.bar(color ='green')
plt.title('diabetes among HvyAlcoholConsump')
plt.subplot(1,5,3)
ax2 = diabetes_df[diabetes_df.HighBP==1].Diabetes_binary.value_counts().plot.bar(color ='green')
plt.title('diabetes among High BP')
plt.show()
Among people who smokes, there is a higher chance that he/she has diagnosed with Diabetes already. Among people who have high blood pressure, there is a significant higher chance that he/she has diagnosed with Diabetes already.
sns.heatmap(diabetes_df.corr())
plt.gcf().set_size_inches(12,11)
I observed that the correlations between Diabetes_binary and HighBP/ BMI are relatively higher. There are also high correlation between PhysHlth and DiffWalk and between MentHleth and PhysHlth. I will remove PhysHlth when applying classification model on it.